In order to solve the challenge we used a
combination of tools. To preprocess the data, we relied on a small PHP script.
To visualize the network data we used Pajek, a popular network analysis program
[http://pajek.imfm.si/doku.php].
Also Pajek has a lot of functionality we used only a small part of it, mainly
the force directed layout algorithms, the degree filter, measures like
centrality, or walks with limited length. Beside these tools, we developed a
small java tool to help us analyze the network data according to the
constraints of the network structure that were given.
Video:
ANSWERS:
MC2.1:
Which of the two social structures, A or B, most closely match the scenario you
have identified in the data?
A
MC2.2:
Provide the social network structure you have identified as a tab delimitated file.
It should contain the employee, one or more handler, any middle folks, and the
localized leader with their international contacts. What are the Flitter names
of the persons involved? Please identify only key connections (not all single
links for example) as well as any other nodes related to the scenario (if any)
you may have discovered that were not described in the two scenarios A and B
above.
MC2.3:
Characterize the difference between your social network and the closest social
structure you selected (A or B). If you include extra nodes please explain how
they fit in to your scenario or analysis.
1.
Approach
Figure 1 – The
pipeline we used for our analysis: First a data selection and aggregation is
made.
After that there is an iterative visualization
approach.
2.
Selection and Preprocessing
We started our analysis by getting
familiar with the data and writing down the constraints for every scenario,
breaking them up into sections that we judged as necessary, possible and merely
speculative. The data was inserted into a MySQL Database using Navicat Lite.
Then we designed an aggregated table with all the connection information that
was given, e.g. the exact geo location on the map and the connection count of
the users by using a small PHP script, which we wrote in 1 hour. The connection
data itself was loaded into Pajek by using the txt2pajek helper tool.
We initially visualized the complete graph
by using Pajek’s force directed layout algorithms and started to reduce the
network into a Pajek partition in which vertices are colored according to the
connection count. The result was still a much cluttered view so we decided to
use more constraints to get rid of useless information.
To do that we first defined four classes
(employee, handler, middleman and fearless leader) and assigned the persons to
the classes according to their connection counts. Based on these classes we
added further constraint first with SQL statements, and later we developed a
lightweight java tool to structure the process of adding constraints.
3.
Visual Analytics Approach
At first the analysis was lead by the idea
to concentrate on the scenario with more information available and easier
constraints which is clearly scenario A. It appeared that scenario B was not
supported directly by the data considering the fixed values for connection
information of the middlemen (which would be 2-3 contacts). The only
possibility according to scenario B was that the middlemen have contact to more
than one of the handlers.
We concentrated on scenario A first and
used the given constraints to reduce the dataset. The critical point was to
check which user of the class employee had connection to at least 3 persons of
the class handler and if all of the handlers had contact to someone with 4-5
contacts. The one with the codename Boris had also to have contact to the
fearless leader having a connection count of over 100.
We wrote our java tool in an iterative
process which took as about 6 hours. In each step of the process we added a new
constraint and then visualized the results with the help of Pajek. Some
constraints e.g. that the handlers are not allowed to communicate among
themselves were not included, because this could be easily seen in the
visualization. As a result we got exactly one network that matched the given
constraints of scenario A.
The next step was to add the tool support
for scenario B. We checked again which user of the class employee had
connection to at least 3 persons of the class handler. But this time it was
possible, that each handler has his own middleman with 2-4 contacts. These
middlemen had to have contact to one potential leader. In the end we saw no
evidence in the data, that scenario B would match.
By mapping the network structure on the
map of Flovania, we realized that the fearless leader didn’t live in a larger
city. But because this geospatial implication was mentioned in the task
description we decided to validate the result again.
In order to do that, we used SQL
statements and visualizations. We started again by looking which employee has
connections to at least 3 handlers. This led to only 13 potential employees.
Than we queried for the connections to potential handlers, middleman and
leaders and visualized the result set for each potential employee separately.
In figure 2 you can see the visualization
of the employee with the ID 19. This network structure nearly matches the
constraints of scenario B. You can see four handlers connected to one employee.
The drawback is that there are only two handlers whose two middlemen have
contact to one leader. In our analysis we found no matching structure for
scenario B at all.
Figure 2 –
Network structure of employee with id 19
By visualizing the network for the
employee with ID 100 (figure 3) it is easy to see that it fits the network
structure of scenario A. There is one employee connected to 3 handlers and they
are connected to one Middleman, who is related to the leader. This is the only
matching structure we found in the data. This mainly manually made analysis for the
13 employees, took us about 2 hours.
Figure 3 –
Network structure of employee with id 100
4.
Result
To visualize our final result we took the
detected employee, the three handlers, the middleman and the fearless leader
and queried for all connections between these persons. Also we added all
international contacts of the fearless leader and the contact of the middleman
Boris to the not jet mentioned member of the organization. In figure 2 you can
see our final network, which seems to be the best matching for the task.
Figure 4 - The complete resulting network of
the criminal organization.
We believe that the person whose ID is 100
is the employee and the persons with the IDs 194, 261 and 563 are his handlers.
As the three handlers have contact with only one person of the group of persons
with 4 or 5 contacts, this person has to be the middleman Boris. Boris has the
ID 4994. And also Boris has only one contact to the group of persons with over
100 contacts. There is a contact to the person whose ID is 4. This person seems
to be the Fearless Leader. These entire IDs we get with the help of our own written
tool. Furthermore we found one person who is linked with Boris and so it is
very probable, that the person whose ID is 1612 is also a member of the
organization.
MC2.4:
How is your hypothesis about the social structure in Part 1 supported by the city
locations of Flovania? What part(s), if any, did the role of geographical
information play in the social network of part one?
Looking at the geo data, we can’t confirm
suggestions that the leader lives in one of the bigger cities. Anyway we thing,
that our result is clearly supported by the location of the members. In figure
5 you can see that the employee lives in the 2nd largest city just like his
direct contacts. As we expect that the embassy is located in a bigger city
(either Koul or Prounov), the habitation of the employee matches this scenario.
Boris lives in a mid-sized city and the leader lives geographically separated
from the rest of the network.
Because there weren’t many points to print
on the map we did this manually with a drawing program. This took us about half
an hour.
MC2.5:
In general, how are the Flitter users dispersed throughout the cities of this
challenge? Which of the surrounding countries may have ties to this criminal
operation? Why might some be of more significant
concern than others?
Generally the flitter users are dispersed
throughout the cities proportionally to the inhabitants. But because we didn’t
have exact information about the number of inhabitants we can’t be really sure
if this is correct. We analyzed this by a simple SQL statement.
As you can see in figure 5 we can say that
Posana may have stronger ties to the organization, because 50 percent of all
international contacts live in Otello. But because the number of contacts to
Posana,Trium and Transak differs only by 3 or 4 contacts we think that this is
no significant concentration of contacts to Posana.
Figure 5 – Geo-mapping of the suspect
network